A systematic study of parameter correlations in large scale duplicate document detection
نویسندگان
چکیده
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study of the performance and scalability of large-scale DDD. It is still unclear how various parameters of DDD, such as similarity threshold, precision/recall requirement, sampling ratio, document size, correlate mutually. In this paper, correlations among several most important parameters of DDD are studied and the impact of sampling ratio is of most interest since it heavily affects the accuracy and scalability of DDD algorithms. An empirical analysis is conducted on a million documents from the TREC .GOV collection. Experimental results show that even using the same sampling ratio, the precision of DDD varies greatly on documents with different size. Based on this observation, an adaptive sampling strategy for DDD is proposed, which minimizes the sampling ratio within the constraint of a given precision threshold. We believe the insights from our analysis are helpful for guiding the future large scale DDD work.
منابع مشابه
A systematic study on parameter correlations in large scale duplicate document detection 1
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study on the performance and scalability of large-scale DDD algorithms. It is still unclear how various parameters in DDD correlate mutually, such as similarity threshold, precision/recall requirement, sampling ratio, and document size. This paper explores the corr...
متن کاملA TWO-STAGE DAMAGE DETECTION METHOD FOR LARGE-SCALE STRUCTURES BY KINETIC AND MODAL STRAIN ENERGIES USING HEURISTIC PARTICLE SWARM OPTIMIZATION
In this study, an approach for damage detection of large-scale structures is developed by employing kinetic and modal strain energies and also Heuristic Particle Swarm Optimization (HPSO) algorithm. Kinetic strain energy is employed to determine the location of structural damages. After determining the suspected damage locations, the severity of damages is obtained based on variations of modal ...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملLarge Scale Parallel Document Mining for Machine Translation
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion ...
متن کاملA TWO-STAGE METHOD FOR DAMAGE DETECTION OF LARGE-SCALE STRUCTURES
A novel two-stage algorithm for detection of damages in large-scale structures under static loads is presented. The technique utilizes the vector of response change (VRC) and sensitivities of responses with respect to the elemental damage parameters (RSEs). It is shown that VRC approximately lies in the subspace spanned by RSEs corresponding to the damaged elements. The property is leveraged in...
متن کامل